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Abstract 

Previous  works  have  proposed  adding  compression  techniques  to  a  variety  of  architectural  styles  to  reduce 
instruction  memory  requirements.  It  is  not  immediately  clear  how  these  results  apply  to  DSP  architectures.  DSP 
instructions  are  longer  and  have  potentially  greater  variation  which  can  decrease  compression  ratio.  Our  results  dem¬ 
onstrate  that  DSP  programs  do  provide  sufficient  repetition  for  compression  algorithms.  We  propose  a  compression 
method  and  apply  it  to  SHARC,  a  popular  DSP  architecture.  Even  using  a  very  simple  compression  algorithm,  it  is 
possible  to  halve  the  size  of  the  instruction  memory  requirements. 
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1  Introduction 

Architectures  for  digital  signal  processing  (DSP) 
have  adopted  several  characteristics  of  Very  Long 
Instruction  Word  (VLIW)  architectures,  including  wide 
instruction  words.  The  cost  of  using  the  explicit  paral¬ 
lelism  of  VLIW  is  much  larger  code  sizes.  Beyond  the 
classical  optimizations  used  to  achieve  smaller  pro¬ 
grams,  compression  can  shrink  program  size  by  utiliz¬ 
ing  repetition  found  at  the  instruction  level.  Several 
compression  techniques  have  been  proposed  for  general 
purpose  architectures  [Wolfe92,  Kozuch94,  Fraser95, 
Liao95,  Benes97,  Ernst97,  Kirovski97,  Lefurgy97, 
Wolf97,  Aranjo98],  Previous  work  focused  on  using 
short  variable-length  codewords  and  increasing  the 
meaning  of  codes  by  allowing  them  to  decode  to  a  list  of 
instructions.  It  is  not  known  if  such  compression  meth¬ 
ods  can  be  used  on  DSP  architectures.  DSP  instructions 
can  hold  multiple  independent  operations  which  poten¬ 
tially  increases  variance  in  the  instruction  bit  patterns. 
Our  previous  study  [Lefurgy98]  noted  that  most  com¬ 
pression  can  be  attributed  to  single  instruction  patterns. 
We  use  this  idea  to  show  that  programs  for  DSP  archi¬ 
tectures  are  highly  compressible.  Compression  for  DSP 
has  two  important  ramifications.  First,  performance  can 
be  traded  for  small  code  size.  Second,  small  code  size 
reduces  the  frequency  at  which  overlays  are  performed 
and  therefore  can  vastly  improve  execution  time. 

The  organization  of  this  paper  is  as  follows.  Section 

2  reviews  previous  work  in  code  compression.  We 
present  our  compression  method  in  section  3.  Our 
experimental  results  are  presented  in  section  4.  In  sec¬ 
tion  5,  we  discuss  some  implications  of  the  results. 
Finally,  section  6  contains  our  conclusions. 

2  Previous  work 

There  have  been  several  recent  works  on  code  com¬ 
pression.  The  Compressed  Code  RISC  Processor 
(CCRP)  [Wolfe92,  Kozuch94,  Benes97]  is  a  MIPS  pro¬ 
cessor  that  compresses  instruction  cache  lines  using 
Huffman  coding.  Dictionary  compression  methods 
[Bell90]  have  been  studied  for  several  processors 
[Liao95,  Lefurgy97],  A  software-managed  compres¬ 
sion-cache  that  decompresses  functions  on  a  cache  miss 
has  been  proposed  [Kirovski97].  Compression  algo¬ 
rithms  based  on  operand  factorization  and  Markov  mod¬ 
els  have  been  suggested  for  transmitting  programs  over 
networks  [Ernst97],  A  C  compiler  that  produces  cus¬ 
tomized  compact  interpreters  and  byte-code  has  been 
demonstrated  [Fraser95].  Carmel  [Sucher98]  is  a  DSP 
architecture  that  uses  a  dictionary  compression  tech¬ 
nique.  More  complicated  compression  algorithms  have 
combined  operand  factorization  with  Huffman  and 
arithmetic  coding  [Aranjo98,  Lekatsas98],  A  VLIW 


program  representation  [Conte95]  reduced  program  size 
by  eliminating  NOP  fields. 

In  a  previous  work  [Lefurgy97],  we  used  dictionary 
compression  to  reduce  the  instruction  memory  footprint 
of  embedded  programs.  We  examined  replacing  fre¬ 
quently  used  sequences  of  instructions  with  a  codeword. 
The  codeword  served  as  an  index  into  a  list  of  instruc¬ 
tion  sequences.  Fetching  and  decoding  the  codewords 
recovered  the  original  sequence  of  instructions  to  exe¬ 
cute.  A  variable-length  encoding  using  small  codewords 
(8-bits,  12-bits,  and  16-bits),  allowed  us  to  compress 
PowerPC  programs  to  60%  of  their  original  size.  We 
will  show  that  even  simpler  compression  techniques  can 
improve  SHARC  [ADI]  programs  by  much  greater 
amounts. 

3  Compression  architecture 

Our  compression  scheme  takes  advantage  of  the 
observation  that  the  instructions  in  programs  are  highly 
repetitive.  Each  unique  instruction  word  in  the  program 
is  put  in  an  instruction  table.  Each  instruction  in  the 
program  is  then  replaced  with  an  index  into  this  table. 
Because  the  instruction  words  are  replaced  with  a 
shorter  code  and  because  the  table  overhead  is  usually 
small  compared  to  the  program  size,  the  compressed 
version  is  smaller  than  the  original.  Instructions  that 
only  appear  once  in  the  program  are  problematic.  The 
original  instruction  in  the  instruction  table  and  the  index 
in  the  program  stream  are  larger  than  the  single  original 
instruction,  causing  a  slight  expansion  from  the  native 
representation. 

The  SHARC  pipeline  is  shown  in  Figure  1. 
SHARC  typically  uses  the  Program  Memory  bus  to 
fetch  instructions  and  uses  the  Data  Memory  bus  to 
fetch  data.  However,  it  can  also  use  these  busses  for  dual 
data  access.  When  this  happens,  instructions  are  exe¬ 
cuted  from  the  instruction  cache  so  that  the  Program 
Memory  bus  can  be  used  for  data  fetch.  The  modifica¬ 
tion  of  SHARC  for  compressed  programs  is  given  in 
Figure  2.  We  augment  the  3  stage  SHARC  pipeline  by 
adding  a  pre-fetch  stage.  First,  the  pre-fetch  stage 
retrieves  the  16-bit  instruction  index  from  the  external 
memory.  The  instruction  table  address  register  holds 
the  location  of  the  instruction  table  in  the  internal  mem¬ 
ory.  Adding  the  contents  of  this  register  to  the  index 
forms  the  address  of  the  SHARC  instruction.  Second, 
the  fetch  stage  uses  this  address  to  get  the  48-bit 
SHARC  instruction  word.  Finally,  the  instruction  is 
issued  to  the  decode  stage. 

There  are  three  costs  for  adding  the  pre-fetch  stage. 
First,  an  extra  internal  memory  bus  must  be  added  to 
support  simultaneous  access  to  the  index  memory,  pro¬ 
gram  memory,  and  data  memories.  SHARC  uses  dual- 
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Figure  1 :  SHARC  pipeline 

Top  shows  program  fetch  during  execution  of  a 
single  data  access  instruction.  Bottom  shows 
program  fetch  during  execution  of  a  dual  data 
access  instruction.  Instructions  are  fetched  from 
cache  when  execution  unit  uses  both  Program 
and  Data  busses  to  fetch  data. 


Figure  2:  Compressed  program  pipeline 

Top  shows  program  fetch  during  single  data  path 
instruction  execution.  Bottom  shows  program 
fetch  during  dual  data  path  execution. 


ported  SRAM  to  achieve  simultaneous  accesses  over  the 
program  and  data  busses.  Instead  of  adding  another  port 
to  SRAM  for  the  index  bus,  a  separate  SRAM  block 
could  be  dedicated  to  index  memory.  Second,  the  pre¬ 
fetch  stage  adds  a  third  branch  delay  slot.  Last,  one  reg¬ 
ister  must  be  added  to  hold  the  address  of  the  instruction 
table. 


When  data  and  program  accesses  compete  for  use 
of  the  program  bus,  SHARC  puts  the  conflicting 
instruction  in  the  instruction  cache.  Future  references  to 
the  same  instruction  address  can  use  the  I-cache  and 
allow  the  program  bus  to  be  used  for  data.  This  feature 
allows  loops  with  instructions  that  use  the  program  bus 
for  data  access  to  execute  without  penalty  due  to  bus 
contention.  This  is  extremely  important  for  DSP  algo¬ 
rithms  which  tend  to  be  composed  of  small,  computa¬ 
tion-intensive  loops.  Our  compression  architecture 
retains  this  valuable  feature. 

The  16-bit  index  limits  a  program  to  use  only  64K 
unique  instructions.  However,  programs  that  use  more 
instructions  can  be  accommodated.  One  alternative  is  to 
add  a  mode-switching  branch  to  the  instruction  set  simi¬ 
lar  to  the  one  used  in  ARM  [ARM95,  Turley95].  This 
would  cause  the  fetch  units  to  switch  between  using 
indexes  and  normal  SHARC  instructions.  In  native 
mode,  the  pre-fetch  stage  could  be  turned  off.  The  fetch 
stage  would  use  the  program  counter  to  fetch  SHARC 
instructions  as  usual.  Another  possibility  is  to  encode 
different  parts  of  the  program  by  using  different  instruc¬ 
tion  tables.  By  simply  re-loading  the  instruction  table 
address  register,  an  entire  new  set  of  64K  instructions 
can  be  used.  This  register  can  also  be  used  to  allow  each 
program  in  an  embedded  system  to  use  its  own  instruc¬ 
tion  table  so  that  the  tables  are  tuned  to  the  instructions 
that  the  program  actually  uses. 

3.1  Branch  instructions 

In  our  previous  work,  we  did  not  compress  branch 
instructions  because  doing  so  could  affect  instruction 
repetition  in  complicated  ways.  Using  patterns  of  only  1 
instruction  with  a  fixed-length  encoding  eliminates  this 
problem.  Compressing  a  program  moves  all  instructions 
to  a  different  location.  This  affects  branches  which  have 
index  and  address  fields.  Additionally,  codewords  are 
smaller  than  the  original  instructions,  so  the  instruction 
fetch  mechanism  and  branches  must  use  a  new  align¬ 
ment.  Since  we  are  using  16-bit  codewords,  PC-relative 
branches  and  absolute  branches  now  specify  a  16-bit 
aligned  address.  In  this  simple  scheme,  the  index  fields 
of  the  PC -relative  branches  do  not  change  since  the  dis¬ 
tance  (number  of  instructions  or  codewords)  between 
the  branch  and  target  are  the  same.  Absolute  branches, 
which  the  compiler  uses  for  function  calls,  must  change 
to  use  the  address  for  the  new  location  of  the  target 
function.  However,  all  such  branches  that  matched 
before  will  also  match  after  this  transformation.  There¬ 
fore,  we  can  easily  compress  branch  instructions  just  as 
any  other  instruction. 
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Benchmark 

Optimization 

Static 

Instructions 

Table  size 

Original  Size 
(bytes) 

Compressed 
Size  (bytes) 

Compression 

Ratio 

mpeg2enc 

none 

28,832 

7,  167 

172, 992 

100, 666 

58.2% 

-01 

26, 537 

8,  118 

159, 222 

101, 782 

63.9% 

go 

none 

81,343 

8,  564 

488, 058 

214,070 

43.9% 

-01 

76, 424 

12, 931 

458, 544 

230, 434 

50.3% 

ghostscript 

none 

352,525 

33, 322 

2, 115, 150 

904, 982 

42 . 8% 

-01 

310, 869 

49, 734 

1, 865, 214 

920, 142 

49.3% 

Table  1 :  Baseline  results 


4  Results 

In  this  section  we  integrate  our  compression  tech¬ 
nique  into  the  SHARC  ADSP-2106x  instruction  set.  We 
use  benchmarks  from  SPEC  CINT95  [SPEC95]  and 
MediaBench  [Lee97].  These  benchmarks  are  compiled 
with  the  VisualDSP  compiler  from  Analog  Devices.  The 
portions  of  the  benchmarks  for  file  I/O  were  removed 
since  they  are  not  supported  by  the  compiler’s  libraries. 
Our  results  include  both  application  and  library  code. 
All  compressed  program  sizes  include  the  overhead  of 
the  dictionary.  Compression  ratio  is  used  to  measure  the 
amount  of  compressibility. 

compressed  size 

compression  ratio  =  - - - 

original  size  (Eq.  1 ) 

Table  1  shows  the  results  for  the  our  basic  compres¬ 
sion  method.  Each  benchmark  was  compiled  with  and 
without  optimizations.  We  only  use  “-Ol”  optimization 
because  higher  levels  of  optimization  exposed  bugs  in 
the  compiler.  The  Table  Size  column  is  the  number  of 
entries  in  the  instruction  table.  There  is  one  entry  for 
each  unique  instruction  bit  pattern  in  the  program.  Cc im¬ 
pressed  Size  is  the  combined  size  of  the  indexes  and  the 
instruction  table. 

Classical  code  optimizations  are  one  way  to  attain  a 
smaller  code  size.  Using  some  optimization  on  the 
benchmarks  reduces  the  number  of  instructions,  but  it 
also  increases  the  table  size.  The  table  size  increases 
because  the  number  of  unique  instructions  increases 
when  single  operation  instructions  are  combined  into  2 
and  3  operation  instructions.  In  un-optimized  code, 
instructions  only  contain  1  operation  and  are  more 
likely  to  match  each  other.  The  reduced  number  of 
instructions  in  the  optimized  code  did  not  account  for 
the  increase  in  the  table  size.  Therefore,  the  smallest 
representation  was  attained  by  compressing  un-opti- 
mized  code. 

The  instruction  tables  contain  many  instructions 
that  are  used  only  once  in  the  entire  program.  One  rea¬ 
son  this  happens  is  that  the  combination  of  registers  the 
register  allocation  algorithm  uses  for  a  particular 


instruction  may  not  match  any  other  instruction.  We  can 
improve  the  compression  ratios  by  removing  these 
unique  instructions  from  the  table.  To  accomplish  this, 
we  select  some  instructions  that  can  be  represented  in 
16-bits  and  mix  them  in  with  the  index  stream.  These 
short  instructions  will  be  coded  with  unused  index  val¬ 
ues.  For  this  experiment,  we  selected  the  8  most  fre¬ 
quent  ALU  operations  for  each  benchmark  to  use  as 
short  instructions.  The  encoding  of  the  index  stream  is 
as  follows.  If  an  index  begins  with  the  bit  0,  then  the 
remaining  15  bits  are  the  index  into  the  instruction 
table.  If  then  index  begins  with  1,  then  next  3  bits  will 
select  an  ALU  operation  in  the  SHARC.  The  remaining 
12  bits  are  divided  into  groups  of  4-bits  to  select  3  regis¬ 
ters  for  the  ALU  operation.  The  3-bit  ALU  operation 
field  selects  one  entry  from  an  8-entry  table  of  SHARC 
ALU  opcodes.  This  table  could  be  programmable  so 
that  each  program  could  select  the  8  best  ALU  instruc¬ 
tions  to  help  compression. 

Results  for  mixing  ALU  operations  and  indexes  are 
presented  in  Table  2.  Some  common  ALU  operations 
used  are  addition,  multiplication,  subtraction,  pass, 
compare,  increment,  and  decrement.  This  encoding  sig¬ 
nificantly  reduces  the  table  size.  However,  for  mpeg2enc 
with  optimization,  there  are  too  few  unused  index  val¬ 
ues  to  add  the  shortened  ALU  instructions.  For  other 
benchmarks,  the  compression  ratio  is  improved  between 
1.2%  and  3.7%. 

5  Discussion 

For  comparison,  we  also  compressed  the  bench¬ 
marks  with  a  nibble  compression  algorithm 
[Lefurgy97].  This  algorithm  reduces  the  size  of  code¬ 
words  (indexes)  to  8  bits,  12  bits,  and  16  bits.  Each 
codeword  can  represent  a  list  of  instructions.  However 
branch  instructions  are  not  encoded.  Instead  they  are 
prefixed  with  an  4-bit  escape  nibble  to  differentiate 
them  from  the  codewords.  Table  3  shows  the  results  and 
compares  them  to  the  baseline  method.  This  demon¬ 
strates  that  more  complicated  schemes  can  attain  better 
compression  ratios.  Interestingly,  the  compression  ratios 
for  the  larger  benchmarks  are  quite  similar  which  shows 
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Benchmark 

Optimization 

Table  size 

Table  size 
change  from 
baseline 

Compressed 
size  (bytes) 

Compression 

ratio 

Compression 
ratio  change 
from  baseline 

mpeg2enc 

none 

6,  107 

-1060 

94, 306 

54.5% 

-3.7% 

-01 

7 ,  323 

-795 

97, 012 

60.9% 

-3.0% 

go 

none 

7,213 

-1351 

205, 964 

42.2% 

-1.7% 

-01 

11, 728 

-1203 

223, 216 

48.7% 

-1.6% 

ghostscript 

none 

29, 183 

-4139 

880, 148 

41 . 6% 

-1.2% 

-01 

46, 498 

-3236 

N/A 

N/A 

N/A 

Table  2:  Addition  of  short  instruction  words 


Benchmark 

Optimization 

Compressed 
size  (bytes) 

Compression 

ratio 

Compression 
ratio  change 
from  baseline 

mpeg2enc 

none 

89, 647 

51.8% 

-6.4% 

-01 

88,541 

55.6% 

-8.3% 

go 

none 

196,260 

40.2% 

-3.7% 

-01 

203, 632 

43.3% 

-7.0% 

ghostscript 

none 

883, 789 

41.8% 

-1.0% 

-01 

852,871 

45.7% 

-3.6% 

Table  3:  Nibble  encoding 


that  even  simple  compression  algorithms  can  be  effec¬ 
tive.  Using  the  shorter  codewords  instead  of  compress¬ 
ing  branches  yielded  slightly  better  compression  ratios 
for  the  larger  benchmarks. 

In  embedded  systems  that  must  use  external  mem¬ 
ory  to  store  programs,  overlays  are  an  important  way  to 
effectively  use  internal  memory  to  achieve  high  perfor¬ 
mance.  Code  compression  can  assist  such  systems  to 
achieve  even  greater  performance.  Smaller  code  size 
reduces  the  frequency  at  which  overlays  must  be  used 
since  a  larger  portion  of  the  program  can  fit  in  internal 
memory.  In  addition,  loading  a  compressed  function 
from  external  memory  requires  less  time  than  loading  a 
non-compressed  function. 

6  Conclusions 

We  have  demonstrated  that  even  simple  compres¬ 
sion  methods  can  be  highly  effective  at  reducing  code 
sizes  in  DSP  programs.  Compressing  only  single 
instructions  to  a  fixed-length  code  allows  us  to  have  a 
simple  mechanism  for  decompression  which  has  mini¬ 
mal  impact  on  the  SHARC  architecture.  Our  method 
can  compress  programs  to  half  their  original  size  while 
allowing  the  hand-coded  numerical  loops  that  are 
important  in  DSP  algorithms  to  run  at  native  speeds. 
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